Identifying Satellites and Periodic Repetitions in Biological Sequences
نویسندگان
چکیده
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to epsilon = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2 epsilon from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10(4) when epsilon = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only O (n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O (n . N (e, k)) in the worst case where k is the length of the repeating unit m, e = [epsilon k] is the number of differences allowed between each repeat unit and the model m, and N (e, k) is the maximum number of words that are not more than e differences from another word of length k. That is, N (e, k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.
منابع مشابه
Computing the Repetitions in a Biological Weighted Sequence
One of the most important goals in computational molecular biology is allocating repeated patterns in nucleic or protein sequences, and identifying structural or functional motifs that are common to a set of such sequences. Although the problem of computing the repetitions in biological sequences has been extensively studied, in the relevant literature, the problem of computing the repetitions ...
متن کاملUsing a Genetic Algorithm for Detecting Repetitions in Biological Sequences
One of the most important goals in computational molecular biology is allocating repeated patterns in nucleic or protein sequences, and identifying structural or functional motifs that are common to a set of such sequences. In this paper we describe a new approach to detect the repetitions of fixed length in Biological Sequences using a Genetic Algorithm. The method involves evolving a populati...
متن کاملCounting maximal-exponent factors in words
The topic of repeating segments in words is one of major interest in combinatorics on words. The topic has been studied for more than a century by many authors after the seminal work [9] which described infinite words containing no consecutive occurrences of the same factor. Beyond the theoretical aspect of questions related to redundancies in words, repetitions, also called repeats in the foll...
متن کاملLossless Filter for Long Multiple Repetitions with Edit Distance
Identifying local similarity between two or more sequences, or identifying repetitions occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or inside a unique sequence, while allowing for a certain number of insertions, deletions, and substi...
متن کاملAccurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes
Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sam...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 5 3 شماره
صفحات -
تاریخ انتشار 1998